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1. INTRODUCTION 

In Stanford research institute, the research team under Charles Rosen developed a robot called 
“Shakey”. Shakey wheeled around the room and even in unfamiliar surroundings, observe the scene with his 
television eyes, responded to the environment to a certain extent. In 1960, Joseph Eagleburger who is known 
as the father of robotics developed an industrial robot by modifying the Devol robot. Hence then, the field of 
robotics have grown and developed rapidly. In the 21st century, the automation in various industries are 
facilitated by industrial robots, which are getting better day by day due to the development of artificial 
intelligence technology. 

There are many artificial intelligence (AI) events that occur across the globe, and one such event is 
robocup. In the International Joint Conference on Artificial Intelligence (IJCAI-95) held at Montreal, Canada, 
an official announcement was made regarding organizing the first robot world soccer game and conferences in 
association with IJCAI-97 in Nagoya, by giving a two-year preparation time. The first robocup which was held 
in 1997 was a huge success in which more than 40 teams participated and it was identified as a great research 
platform for many researchers. The objective of the robocup is to focus on the research and development in all 
major AI tasks: perception, control, navigation, strategy and planning. Perfect identification and detection of 
the targets is very crucial for the robot players to perform all the five above mentioned AI tasks on the robot 
soccer environment. The computer vision of the robot players forms the basis of the entire AI task that the 
robot has to perform on the soccer environment. Object detection and recognition is a key factor that should be 
perfect and at the best for improving the game playing. Feature extraction from an image is the basic technology 
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of machine vision that involves extraction of certain key information with subsequent recognition and 
classification, which are now being used in real sports also for improving the playing skills. Along with the 
well-developed machine vision the robot player needs time coordination and collision avoidance strategies as 
discussed by Tabasso et al. [1]. The development and the on-going research in the field of intelligent soccer 
playing robots is aimed at industrial robot to perform the next level of intelligence and also in development of 
autonomous cars [2] which involve object detection, obstacle avoidance, tracking just like the soccer playing 
robots. 

However, an enormous amount of research is being carried out on AI and deep learning for automation 
which is tested on the robocup platform, the traditional existing, object detection and tracking algorithm has 
its own merits and limitations like, slow recognition and tracking, low accuracy and it needs human intervention 
for operations. This article proposes an improved approach to object detection and tracking with no human 
intervention based on AI coupled with deep learning and this method improves the accuracy and the speed of 
playing robot. The model proposed in this article is known as ROBOSOCCER, which is used to convolve the 
image layers, and perform classification of the three class: Nao, Ball and the goal-post and for tracking stage 
Kalman Filter is employed. The robustness, fast and accuracy of the proposed approach is evident from the 
simulation results. 


2. ROBOCUP AND OBJCT DETECTION 

In this session a brief study of the trends in the development of object detection and tracking algorithm 
is done. The mobile robot needs to identify various objects based on the application in which field it is 
deployed. In the robosoccer platform, there are two categories of moving object: nao, ball and one static object, 
the goal-post. 


2.1. The development trend in robocup on object detection and tracking 

Menashe ef al. [3] and Cruz et al. [4], proposed ball detection based on machine learning. Speck ef al. [5] 
proposed an approach that uses neural networks for real-time classification and detection on a humanoid nao 
robot. Apart from deep learning methods, there are approaches that depends on colour classification, that works 
with the idea that the entire soccer environment is green and the objects can be recognized by finding the gaps 
as proposed by Hoffmann et al. [6], Lenser and Veloso [7], Leiva et al. [8], Dijk and Scheunemann [9] proposed 
a neural network based semantic segmentation that combines the robot vision techniques on grayscale images 
and the convolutional neural network (CNN) classification. The object detection and tracking on the robosoccer 
environment is a complex problem as it depends on various factors like shadows, reflections, obstructions, 
vanishing points, high activity points and there are main challenges like occlusions, speed, multiple scales and 
limited data. The vision sensors of the robot players is used to capture the environment, but due their mobility, 
the captured data is imprecise. Due to the uncertainty in the captured data, the identification of the object is 
harder and hence there is a necessity of improvement. 

The robot soccer league had moved from orange ball to black and white ball from 2016 league. The 
work done by Menashe et al. [3] detects a ball, without knowing the ball’s position for which series of heuristic 
region of interest identification techniques and super-vised machine learning methods was used. Kukleva et al. 
[10] proposes a CNN approach for detecting and tracking the robotsoccer ball, in their approach they use the 
history of frames instead of using the current frame to detect the ball. Loncomilla and Solar [11] proposed 
YoloSPoc which uses maximal activation convolutions descriptors in which good quality object proposals are 
done by YoloV3. Poppinga and Laue [12] proposed just enough time (JET-NET) convolutional neural network 
frame which is able to perform at its best in the real time robot detection on the robosoccer environment. JET- 
NET does only player detection using transfer learning. Szemenyei and Castro [13], proposed a new 
architecture for object detection on robot soccer environment, which again uses transfer leaning, and the 
proposed architecture is said to have outperformed tiny-yolo in terms of speed and accuracy. 

Teimouri et al. [14] proposed a new method for detecting soccer ball for low-cost humanoid robot, 
which achieves high accuracy of up to 97.17% in which the entire ball is extracted by recursive algorithm using 
key image-based feature after which a light weight CNN is used. Houliston and Chalup [15] proposed a 
geometric input transformation called visual mesh to generate a plot on visual space, which reduces the 
complexity in computation, by standardizing the pixel and the object’s feature density. This proposed work 
was an enhancement of CNN for object detection in resource-constraint robots. Szemenyei and Castro [16] 
proposed an end-to-end neural network for nao vision on robot soccer environment in which there are two key 
neural networks, one for segmentation and the other for propagation. Szemenyei and team had trained the 
models on dummy datasets and later tuned them on real images from Nao robot. Leiva et al. [8] proposed a 
framework that detects the ball and other nao players, their orientation and key features of the field without 
using any colour information, as all processing is done on grayscale images and cascade methodology is used 
which combines classical approaches and modern CNN based classifiers. Felbinger et al. [17] proposed the 
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genetic algorithm approach, which optimized the CNN hyper parameters, with minimum size dataset that 
resulted in cost effective inference on nao robot. 


2.2. The development trend in CNN on object detection 

Lecun et al. [18], established the basic of image classification using the convolution neural network. 
Even though LeNet had proved that CNN could do magic in image classification, there was no big development 
due to the fact of limited computing power and availability of data and it was believed that CNN could be used 
for digit identification only. In the complex recognitions like face detection or object detection, a HarrCascade 
or SIFT feature extractor with an SVM classifier was used. Krizhevsky et al. [19], proposed an approach that 
used the multi-layer CNN concept from LeNet and reached an accuracy of 84.7%. In their work, the size of the 
CNN was increased and rectified linear unit (ReLU) non-linearity was used that accounted for computation 
cost reduction and the ImageNet size was also drastically increased. 

Rethage et al. [20], representing the visual geometric group, proposed the VGG network which is now 
being used as the back-bone network for most of the computer vision models like fuzzy cognitive networks 
(FCN) that are used for semantic segmentation and for Faster R-CNN for object detection. The VGG reached 
an excellent result of 93.2%. Szegedy et al. [21], focused on reducing the computation cost and gradient 
diminishing problems rather than focusing on performance improvement like VGG and GoogLeNet. Loffe and 
Szegedy [22] introduced the concept of batch normalization and used mini-batches to approximate the entire 
dataset and increased the training time. They also introduced the learnable parameters: scale and shift, which 
makes the network, normalize each layers on its own. 

He et al. [23] proposed the ResNet model, in which irrespective of number of layers used, the gradient 
flow in the network is found to be better. The residual model learns the variations between the input and the 
output without each layer fitting to the feature mapping, which results in better results with less information. 
Chollet [24] introduced the Xception network that outperforms both ResNet and InceptionV3 and became 
source of open tool for deep learning approaches. They suggested that in convolution neural networks, 
decoupling of cross-channel correlations and spatial correlations is possible. Harjoseputro et al. [25] introduced 
MobileNet, which is same to same like Xception, but MobileNet uses depth-wise convolution with minimum 
parameters and high efficiency. 

Zoph et al. [26] proposed to search for an architectural building block on a small dataset and then 
transfer the block to a larger dataset. New search space called the neural architecture search network (NASNet) 
search space was introduced in the work which enables transferability and the authors have proposed a new 
regularization technique called scheduled drop path that significantly improves generalization in NASNet 
models. Tan and Le [27] pro-posed a new scaling method that uniformly scales all dimensions of depth, width, 
resolution using an effective compound coefficient. Neural architecture search is used to design a new baseline 
network call it up to obtain a family of models called Efficient Nets, which achieves higher accuracy and 
efficiency. Sergey et al. [28] proposed a deep learning-based object detection and position estimation and the 
potential applicability of the developed work framework was then demonstrated on an experimental robot- 
manipulation setup realizing a simplified object pick and place scenario. Kanthi et al. [29] proposed a multi- 
scale 3D-convolutional neural network for hyperspectral image classification. There are many other works in 
the world of CNN towards image recognition that are famous in their own area, like CNN proposed by Murugan 
et al. [30] and Cui et al. [31] used in underwater object detection. 


3. RESEARCH METHOD 

The proposed approach includes the dataset generation, deploying the CNN to classify the predefined 
classes, tracking the ball and other robot players in the environment. This session explains clearly about the 
CNN architecture and the algorithm that were used for the research. The tracking algorithm which is very 
crucial for the game is also explained. 


3.1. Dataset 

For the work, there was no proper dataset available. SPQR team had released open dataset for research 
purpose, but it did not meet the needs. So, the datasets were created by generating screenshots of the robosoccer 
videos as per the frame rate. Figure 1 shows some sample images from the generated dataset. Many open online 
converters were available for the same. Randomly images were generated and the same was annotated with the 
three class that the work is concerned about: nao, ball and the goal-post. Annotation is the process of labelling 
all the images in the dataset and for this work, it is done using a graphical tool called labelimg. Some images 
with environments in natural and artificial illumination at the same time which makes some areas subjected to 
high contrast and different brightness were included in the dataset, for the work, blurred and occluded images 
for better training and efficient identification and over-fitting problems. 
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Figure 1. Random images from the dataset generated 


3.2. CNN architecture and training 

The CNN architecture proposed in the work is called as ROBOSOCCER model. We have modified 
CSPDarknet-53 network, which is used as the backbone of the proposed object detector model. Feature 
pyramid network (FPN) is used as the neck network and the fully connected layer as the head network, in 
addition spatial pyramid pooling (SPP) layers are used to remove the fixed size constraint of the network. Since 
for the object detection, we are concerned in detecting only the three class: nao, ball and the goal-post, small 
network with less convolution layers is only needed. The speed of the object detection is increased because 
less number of layers means, reduced learnable parameters and reduced computation. The reduced number of 
layers may result in errors in detection, but since the number of class in the work is very less, the experimental 
results has proved the ROBOSOCCER is an efficient model for detecting nao, ball and the goal-post. The 
proposed ROBOSOCCER model has 50 layers, several changes to the original model was introduced so that 
it has an optimized performance in the robot soccer environment. The original CSPDarknet-53 Yolo has the 
best optimal performance when trained on complex data set such as COCO which detects 80 class of object. 
Since the work is comparatively simple in terms of detecting only three class, in the ROBOSOCCER strided 
convolution is used for downscaling and maxpooling layers are replaced with strided convolution, because 
Sabour et al. [32] proposed that the maxpooling layers reduces the spatial information. The aggressive 
downscaling and increase in stride should decrease the accuracy of the model as per the theory, but the 
replacement of the max pooling has balanced the loss. Due to the limited capability of softmax function [33] 
rectified linear unit (ReLU) activation function is used in the proposed ROBOSOCCER model. 

To minimize the loss function SGDM optimization algorithm is used on the proposed ROBOSOCCER 
Model. A mini-batch of 600 sample images from the training dataset is used for network parameters update. A 
learning rate of 0.001 was experimentally selected and the CNN was trained using our ball-nao-goal-post 
dataset divided in a ratio of 90:10 for train and test set. The training can be done until the network converges 
using loss function threshold or for a predefined number of iterations. In the work, the process is repeated for 
6000 iteration which is 193 epochs. The test mAP@50 of 95.18% and training mAP@50 of 96.33% was 
achieved and average error of 4.065%, which will further reduce on increasing the iterations. 


3.3. Feature extraction 

It is known fact that feature extraction is a very important part of the computer vision. The feature 
extraction is usually done by various techniques such as principle component analysis (PCA), independent 
component analysis (ICA), linear discriminant analysis (LDA), locally linearly embedding (LLE), t-distributed 
stochastic neighbour embedding (t-SNE) and auto encoder (AE). Now days in computer vision, deep CNN 
itself is used for feature extraction and classification. The image enhancement through the features extracted 
by the CNN is better than the traditional methods and for the proposed work, Darknet-53 is used for feature 
extraction. In CNN, the convolution of images with the filters is used to extract various invariant features from 
the image that is given to the next layer. In the next layer the input features are convolved with a new filter and 
further more invariant features are extracted as shown in Figure 2. The process continues till we get the final 
feature or the output of the feature extraction CNN. 
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Figure 2. Feature extraction through CNN 


3.4. Tracking algorithm 

In the proposed work, the image capture or live input of the nao robot is not always frontal, because 
based on the position of the robot the front camera might not capture the image as the detector needs and many 
time the detector fails. But the three class: the nao, the ball and the goal-post will be blobby in the captures of 
the nao front camera. Yang [34] proposed the moving video image target tracking and recognition based on 
CNN feature selection in which feature centers are generated as per the distance matrix between feature images 
and feature dimensions are compressed. In the literature, there are many studies on face detection and tracking 
algorithms, Mliki and Hammami [35], had introduced a full automatic approach to achieve face detection and 
tracking with estimation in video sequences. In their work, they propose a combination between detection and 
tracking to overcome the various challenging problems that might occur while detecting or tracking faces. 

The ROBOSOCCER model and the Kalman filter are combined for tracking the ball and the nao on the 
robot soccer environment is depicted in Figure 3. First object detection is employed to ROBOSOCCER model 
for detecting the ball/nao and its position in the first frame. Then the next frame is given to the model and the new 
position is got [36]. Now with the old measurement, the new measurement is updated. After measurement update, 
posteriori state estimate and posteriori estimate error covariance will be employed to do the time update to predict 
before state estimate and a before estimate error covariance in the next frame. The motion state variable of the 
Kalman filter is defined as s={x, y, a, b}, here the x, y are the coordinates of the object and a, b is the velocity of 
the object in two directions. The state at each frame k and the measurement are governed by (1) and (2) 
respectively [37], [38]. The prediction noise and the measurement noise are represented as n (k) and m (k) with 
covariance N and M. X is the nxn matrix that denotes the priori estimate at present state from the previous state 
and Y is the nxl matrix that denotes the control with the state s. H is the observation matrix. 


Sx = XSy_y + Yag_y + Mya (1) 
Zk = Hs, + by (2) 


There are two updates, 
— The Time update:A priori state estimate given in (3) and the priori estimate error covariance is given in (4), 


$y = XS8yo4 + Yax-1 (3) 

P, = XP,_,X™+N (4) 
— The Measurement update:Kalman Gain, 

K, = PH" (HP, HH? + M)1 (5) 


Posteriori state estimate from the measurementz,, 
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$e = Se + Ky(2y_ + H8z) (6) 
posteriori estimate error coefficient: 


Py = U — KH) Pe (7) 


3.5. Novelties in approach 

In this work a novel ROBOSOCCER CNN model is proposed by modifying CSP Darknet CNN. The 
model is trained to classify the nao, ball and the goal-post classes using transfer learning. There are no suitable 
open datasets are available for these kinds of studies, so creation of new dataset with annotation has been taken 
care. Figure 3 demonstrates the tracking algorithm. 

In this proposed work, Kalman filter with time update and measurement update to track the ball and 
the nao is used. The proposed work explores the performance measure, intersection over union (IoU) for the 
three class’s detection and tracking. The proposed approach for tracking doesn’t need any object with bounding 
boxes in the first frame, instead it detects itself and do the tracking. The proposed model achieves an overall 
speed of 1.67 FPS, which will support the robosoccer application. 


INITIALIZATION K=1 


(n-1) frame 
_ measurements/priori 
estimation) 


Send the nth frame to 
—— the ROBOSOCCER 
model 


| MEASUREMENT UPDATE 


Get the nth frame 
aE  measurementsiposte 
nori estimation) 


Time upoate K=K+i 


Figure 3. Tracking algorithm 


4. RESULTS AND DISCUSSIONS 

In the work, Nvidia Tesla P1O0-PCIE GPU and 25GB RAM was used for the training and testing the 
different models to arrive at the proposed ROBOSOCCER model. The proposed model was evaluated at different 
IoU threshold. The minimum permissible difference between the ground truth and predicted anchor boxes to 
decide whether the detection is correct is defined as IJoU value. When the threshold is set at small values, it means 
the model is evaluated leniently in terms of localization. Even very small errors in localization will make the loU 
to fall drastically is the problem with this evaluation method. This issue will lead this method of evaluation to 
ignore the classification or the confidence error. To balance this, mean average precision (mAP) values were 
computed for the validation dataset at different error measure, like the Euclidean distance between the anchor box 
centers. The results as in Table 1, shows that the pro-posed ROBOSOCCER model performs well and achieves 
higher detection. In the work, many tests were run, by increasing the retraining the initial layers, made the model 
to learn the dataset better, which results in faster convergence. In this process of retraining smaller learning rate 
was used. Precision-recall curve as shown in Figure 4. The PR curve for the prosed model shows that the model 
performance is good for use as in Figure 4(a). The PR curve was also drawn for each of the three class as in 
Figures 4(b)-(d). The proposed model is compared with the existing methods as seen in Table 2. Figure 5 shows 
the mAP and the IOU changes during the training process. This mAP metrics for evaluating the proposed model 
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focusses on the rigidity of the object detected rather than the anchor boxes shape errors. 
Table 1. Evaluation metrics 
IoU Threshold Precision AvgloU (in%) _Fl-Score — mAP (in %) 


0.5 0.94 79.76 0.95 95.18 
0.55 0.92 79.41 0.94 92.80 
0.6 0.88 78.14 0.92 88.81 
0.65 0.82 7335 0.88 80.06 
0.7 0.76 71.67 0.82 72.45 
0.75 0.70 67.50 0.76 67.56 
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Figure 4. Precision-recall curve (a) PR curve of the ROBOSOCCER model, (b) PR curve of the class2: Goal- 
post, (c) PR curve of the class1: Ball, and (d) PR curve of the Class0: Nao 


Table 2. Comparison with existing object detector in robosoccer soccer environment 


Model Reference mAP Focused Class of Identification 
MobileNet SSD v1 Yang [34] 68.81% Small Sized League 
Neural approach using CNN Cruz et al. [4] 83% Only Ball 
JET-Net Loncomilla and Solar [11] 85% Only Nao 
ROBO Poppinga and Laue [12] 81.38% Nao, Ball and Goal-post 
BallNet Szemenyei and Castro [13] 89.72% Only Ball 
ROBOSOCCER (Proposed) 96.18% Nao, Ball and Goal-Post 
100 
80 
» 60 
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Figure 5. mAP and IOU during the training 
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5. CONCLUSION 

In this work, a novel ROBOSOCCER CNN model is proposed by modifying CSP Darknet CNN. The 
model is trained to classify the nao, ball and the goal-post classes using transfer learning. There are no suitable 
open datasets are available for these kind of studies, so creation of new dataset with annotation has been taken 
care. In this proposed work, Kalman filter with time update and measurement update to track the ball and the 
nao is used. The proposed work explores the performance measure, intersection over union (IoU) for the three 
class’s detection and tracking. The proposed approach for tracking don’t need any object with bounding boxes 
in the first frame, instead it detects itself and do the tracking. The proposed model achieves an overall speed of 
1.67 FPS, which will support the robosoccer application. 
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